This is a library to find the best performing configuration from a set of dimensions (i.e. schemas, partition, storage) which can be specified inside the settings.yaml file in the resource¶

In [ ]:
%pip install PAPyA==0.1.0

Load the configuration file and log files location for the experiment¶

Configurations for SP2Bench Data

In [1]:
config_sp2bench = "settings.yaml" # config file location
logs_sp2bench = "log" # logs file location

Configurations for Watdiv Data

In [2]:
config_watdiv = "settings_watdiv.yaml" # config file location
logs_watdiv = "log_watdiv" # logs file location

Configuration file

The configuration file is a yaml data-serialization language which has two main parts, the dimensions and the number of query experiments. You can add more dimensions here or change these existing dimensions to anything you need

Example :

dimensions:
    schemas: ["st", "vt", "pt", "extvt", "wpt"]
    partition: ["horizontal", "predicate", "subject"]
    storage: ["csv", "avro", "parquet", "orc"]

query: 11

Log file structure

the structure of the log files must follow the order of dimensions in the configuration file (i.e. {schemas}.{partition}.{storage}.txt) and the subfolders should be the ranking sets of the experiments (i.e. dataset sizes)

Example :

UI Module
└───log
    │
    |───100M
    |    │   st.horizontal.csv.txt
    |    │   st.horizontal.avro.txt
    |    │   ...
    │
    └───250M
        |   st.horizontal.csv.txt
        │   st.horizontal.avro.txt
        │   ...

Single Dimensional Ranking¶

SDRank is a class module from PAPyA library to calculate ranking score R for each dimension independently that operates over a log-based structure that user specified on the configuration file.
The value of R represents the performance of a particular configuration (higher value means better performing configuration). We used Ranking Function R below to calculate the rank scores:

$$R =\sum \limits _{r=1} ^{d} \frac{O_{dim} * (d-r)}{|Q| * (d-1)}, 0<R<=1$$

$d$ : total number of parameters (options) in a particular dimension
$O_{dim}$ : number of occurences of the dimension placed at rank $r$ (Rank 1, Rank 2, Rank 3, ...)
$|Q|$ : total number of queries

PAPyA.Rank.SDRank¶

class Rank.SDRank(config_path, log_path, ranking_sets, dimension)¶

Parameters:
  config_path : str
  Specify the path to your configuration file(s). i.e ./UIModule/settings_watdiv.yaml</i>
 log_path : str
  Specify the path to your log file(s). i.e ./UI Module/log_watdiv</i>
 ranking_sets : str
  Ranking sets of user choice. i.e dataset sizes (100M)</i>
 dimension : str
  A single dimension to be ranked. i.e schemas</i>

In [3]:
# this class takes single dimension and dataset sizes as parameters that user specified inside their log files
from Rank import SDRank

schemaSDRank_100M = SDRank(config_watdiv, logs_watdiv, '100M', 'schemas')
schemaSDRank_250M = SDRank(config_watdiv, logs_watdiv, '250M', 'schemas')
schemaSDRank_500M = SDRank(config_watdiv, logs_watdiv, '500M', 'schemas')
partitionSDRank_100M = SDRank(config_watdiv, logs_watdiv, '100M', 'partition')
partitionSDRank_250M = SDRank(config_watdiv, logs_watdiv, '250M', 'partition')
partitionSDRank_500M = SDRank(config_watdiv, logs_watdiv, '500M', 'partition')
storageSDRank_100M = SDRank(config_watdiv, logs_watdiv, '100M', 'storage')
storageSDRank_250M = SDRank(config_watdiv, logs_watdiv, '250M', 'storage')
storageSDRank_500M = SDRank(config_watdiv, logs_watdiv, '500M', 'storage')

Rank.SDRank.calculateRank¶

SDRank.calculateRank(*args)¶

The function that automates calculating the rank scores of a single dimension using the Ranking Function above.

Returns a table of configurations which is sorted based on the best performing configuration according to their Ranking Score along with number of occurences of the dimension being placed at the rank _r_ (1st, 2nd, 3rd, ...)

Parameters:
  *args : str or list
  This method takes an arbitrary number of parameters of strings and lists.
   str -> slice the table according to string input. i.e. "predicate" will slice the table by the predicate partitioning
   list -> remove some queries out of the ranking calculations. i.e [7,8,9] will remove query 7, 8,and 9 from the calculation
</i>

In [4]:
exclude_query = [1,2,3]
In [5]:
schemaSDRank_100M.calculateRank(exclude_query).head()
Out[5]:
Rank 1 Rank 2 Rank 3 Result
pt.horizontal.parquet 15.0 2.0 0.0 0.941176
pt.subject.parquet 13.0 3.0 1.0 0.852941
pt.horizontal.orc 13.0 2.0 2.0 0.823529
st.predicate.orc 9.0 8.0 0.0 0.764706
pt.subject.orc 9.0 7.0 1.0 0.735294
In [6]:
schemaSDRank_250M.calculateRank(exclude_query).head()
Out[6]:
Rank 1 Rank 2 Rank 3 Result
pt.horizontal.parquet 16.0 1.0 0.0 0.970588
pt.horizontal.orc 14.0 3.0 0.0 0.911765
pt.subject.parquet 14.0 3.0 0.0 0.911765
pt.subject.orc 12.0 5.0 0.0 0.852941
st.predicate.orc 10.0 7.0 0.0 0.794118
In [7]:
schemaSDRank_500M.calculateRank(exclude_query).head()
Out[7]:
Rank 1 Rank 2 Rank 3 Result
pt.subject.parquet 15.0 2.0 0.0 0.941176
pt.horizontal.parquet 14.0 3.0 0.0 0.911765
pt.horizontal.orc 13.0 4.0 0.0 0.882353
vp.predicate.orc 12.0 5.0 0.0 0.852941
pt.subject.orc 12.0 4.0 1.0 0.823529
In [8]:
partitionSDRank_100M.calculateRank(exclude_query).head()
Out[8]:
Rank 1 Rank 2 Rank 3 Result
vp.predicate.orc 16.0 1.0 0.0 0.970588
vp.predicate.parquet 15.0 1.0 1.0 0.911765
pt.horizontal.orc 11.0 5.0 1.0 0.794118
st.subject.orc 10.0 7.0 0.0 0.794118
st.subject.parquet 9.0 8.0 0.0 0.764706
In [9]:
partitionSDRank_250M.calculateRank(exclude_query).head()
Out[9]:
Rank 1 Rank 2 Rank 3 Result
pt.horizontal.parquet 12.0 5.0 0.0 0.852941
st.predicate.parquet 13.0 2.0 2.0 0.823529
st.subject.orc 10.0 7.0 0.0 0.794118
pt.horizontal.orc 9.0 8.0 0.0 0.764706
pt.subject.orc 8.0 8.0 1.0 0.705882
In [10]:
partitionSDRank_500M.calculateRank(exclude_query).head()
Out[10]:
Rank 1 Rank 2 Rank 3 Result
st.predicate.parquet 16.0 1.0 0.0 0.970588
pt.subject.parquet 13.0 4.0 0.0 0.882353
pt.subject.orc 13.0 4.0 0.0 0.882353
vp.horizontal.parquet 12.0 5.0 0.0 0.852941
st.predicate.orc 10.0 7.0 0.0 0.794118
In [11]:
storageSDRank_100M.calculateRank(exclude_query).head()
Out[11]:
Rank 1 Rank 2 Result
st.subject.orc 17.0 0.0 1.000000
st.horizontal.orc 16.0 1.0 0.941176
st.predicate.orc 12.0 5.0 0.705882
vp.subject.orc 11.0 6.0 0.647059
pt.predicate.parquet 10.0 7.0 0.588235
In [12]:
storageSDRank_250M.calculateRank(exclude_query).head()
Out[12]:
Rank 1 Rank 2 Result
st.subject.orc 17.0 0.0 1.000000
st.horizontal.orc 17.0 0.0 1.000000
vp.subject.orc 13.0 4.0 0.764706
vp.predicate.orc 12.0 5.0 0.705882
pt.horizontal.parquet 12.0 5.0 0.705882
In [13]:
storageSDRank_500M.calculateRank(exclude_query).head()
Out[13]:
Rank 1 Rank 2 Result
st.horizontal.orc 17.0 0.0 1.000000
st.subject.orc 17.0 0.0 1.000000
vp.predicate.orc 15.0 2.0 0.882353
pt.horizontal.orc 14.0 3.0 0.823529
vp.subject.orc 14.0 3.0 0.823529

Rank.SDRank.plotRadar¶

SDRank.plotRadar()¶

Ranking over one dimension is insufficient when it counts multiple dimensions. The presence of trade-offs reduces the accuracy of single dimension ranking functions which could be seen in the radar plot below.

This method returns a radar chart that shows the presence of trade-offs by using the single dimension ranking criterion that reduces the accuracy of the other dimensions

In [14]:
# This example shows a figure of the top configuration of ranking by schema is optimized towards its dimension only, ignoring the other two dimension.
from Rank import SDRank
SDRank(config_watdiv, logs_watdiv, '100M', 'schemas').plotRadar()
In [15]:
SDRank(config_watdiv, logs_watdiv, '100M', 'partition').plotRadar()
In [16]:
SDRank(config_watdiv, logs_watdiv, '100M', 'storage').plotRadar()

Rank.SDRank.plot¶

SDRank.plot(view)¶

In addition to radar plot, PAPyA also provides visualization that shows the performance of a single dimension parameters that user can choose in terms of their rank scores

This method returns a bar chart that shows a particular dimension rankings, pivoting another dimension by user viewing projection

Parameters:
  view : str
  This method takes a string of dimensional option that user choose to view as projection

Example:
  _Schemas_ single dimensional ranking viewed from _Predicate Partitioning_ by pivoting over the _Storage_ dimension

In [18]:
from Rank import SDRank

# example of schema dimension plots
SDRank(config_watdiv, logs_watdiv, '100M', 'schemas').plot('predicate')
SDRank(config_watdiv, logs_watdiv, '100M', 'schemas').plot('horizontal')
SDRank(config_watdiv, logs_watdiv, '100M', 'schemas').plot('subject')
SDRank(config_watdiv, logs_watdiv, '100M', 'schemas').plot('orc')
SDRank(config_watdiv, logs_watdiv, '100M', 'schemas').plot('parquet')
Out[18]:
<AxesSubplot:title={'center':'Schemas SD Rank pivoting Partition formats for Parquet Storage'}>

Rank.SDRank.plotBox¶

SDRank.plotBox(q = None)¶

In order to show the distributions of our query runtimes data, we need a box plot diagram to compare these data between queries in our experiment. Box plot can help provide information at a glance, giving us general information about our data

This method returns a box plot diagram that shows both maximum and minimum runtimes of each individual queries for a particular single dimensional ranking

Parameters:
  q : list
  This method can take a list of column names that user can specify to choose for plotting to have a more precise box plot diagram. i.e. list of ["Q1", "Q2", "Q3"] will output a
   box plot only for those queries

In [20]:
from Rank import SDRank

# Box plot example of all queries of schema ranking dimension
SDRank(config_watdiv, logs_watdiv, '100M', 'schemas').plotBox()
SDRank(config_watdiv, logs_watdiv, '100M', 'partition').plotBox()
SDRank(config_watdiv, logs_watdiv, '100M', 'storage').plotBox()

# Box plot example of query 1,2,3 of schema ranking dimension
SDRank(config_watdiv, logs_watdiv, '100M', 'schemas').plotBox(["Q1", "Q2", "Q3"])

Multi Dimensional Ranking¶

With the presence of the trade-offs introduced in the single dimensional ranking function, we propose an optimization technique that aims to find the non-dominated solutions or the configuration combinations by optimizing all dimensions at the same time which utilize the NSGA2 Algorithm.
In this experiment, we provide two ways to use the NSGA2 Algorithm:

  • The first method is paretoAgg which operates on the single dimensional ranking criteria. This method aims to maximize performance of the three ranks altogether
  • The second method is paretoQ which apply the algorithm considering the rank sets obtained by sorting each query results individually. This method aims at minimizing query runtimes of the ranked dimensions

PAPyA.Rank.MDRank¶

class Rank.SDRank(config_path, log_path, ranking_sets)¶

Parameters:
  config_path : str
  Specify the path to your configuration file(s). i.e ./UIModule/settings_watdiv.yaml</i>
 log_path : str
  Specify the path to your log file(s). i.e ./UI Module/log_watdiv</i>
 ranking_sets : str
  Ranking sets of user choice. i.e dataset sizes (100M)</i>

In [21]:
from Rank import MDRank

# example of MDRank class with 100M dataset size as ranking set of the experiment
multiDimensionRank_100M = MDRank(config_watdiv, logs_watdiv, '100M')
multiDimensionRank_250M = MDRank(config_watdiv, logs_watdiv, '250M')
multiDimensionRank_500M = MDRank(config_watdiv, logs_watdiv, '500M')

Rank.MDRank.paretoQ¶

MDRank.paretoQ()¶

This method returns a table of configuration solutions as well as their dominated ones, according to the Non-Dominated Sorting Algorithm II (NSGA2) that was applied to minimizing query runtimes of the ranked dimensions

In [22]:
multiDimensionRank_100M.paretoQ().head()
Out[22]:
Solution Dominated
0 pt.horizontal.parquet st.horizontal.orc
1 pt.subject.parquet st.horizontal.parquet
2 pt.horizontal.orc pt.predicate.parquet
3 pt.subject.orc pt.predicate.orc
4 st.subject.orc
In [23]:
multiDimensionRank_250M.paretoQ().head()
Out[23]:
Solution Dominated
0 pt.horizontal.parquet st.subject.parquet
1 pt.subject.orc st.horizontal.orc
2 pt.subject.parquet st.horizontal.parquet
3 pt.horizontal.orc pt.predicate.orc
4 vp.horizontal.parquet pt.predicate.parquet
In [24]:
multiDimensionRank_500M.paretoQ().head()
Out[24]:
Solution Dominated
0 pt.subject.orc st.predicate.orc
1 pt.subject.parquet vp.predicate.parquet
2 pt.horizontal.orc st.predicate.parquet
3 pt.horizontal.parquet st.horizontal.orc
4 vp.subject.orc st.subject.parquet

Rank.MDRank.paretoAgg¶

MDRank.paretoAgg()¶

This method returns a table of configuration solutions as well as their dominated ones, according to the Non-Dominated Sorting Algorithm II (NSGA2) that was applied on the single dimensional ranking criteria which maximizes the performance of all ranking sets altogether

In [25]:
multiDimensionRank_100M.paretoAgg().head()
Out[25]:
Solution Dominated
0 st.predicate.orc st.horizontal.parquet
1 vp.predicate.parquet pt.predicate.orc
2 pt.horizontal.orc pt.predicate.parquet
3 pt.subject.parquet vp.horizontal.orc
4 vp.predicate.orc vp.subject.parquet
In [26]:
multiDimensionRank_250M.paretoAgg().head()
Out[26]:
Solution Dominated
0 st.subject.orc st.horizontal.parquet
1 pt.horizontal.parquet pt.predicate.parquet
2 pt.predicate.orc
3 st.subject.parquet
4 st.horizontal.orc
In [27]:
multiDimensionRank_500M.paretoAgg().head()
Out[27]:
Solution Dominated
0 vp.predicate.orc st.horizontal.parquet
1 st.subject.orc pt.predicate.parquet
2 st.predicate.parquet st.subject.parquet
3 vp.subject.orc pt.predicate.orc
4 pt.subject.parquet vp.predicate.parquet

Rank.MDRank.plot¶

MDRank.plot()¶

This method returns a plot for multi dimensional ranking solutions according to paretoAgg as shades of green projected in a three dimensional space

In [33]:
multiDimensionRank_100M.plot()
# multiDimensionRank_250M.plot()
multiDimensionRank_500M.plot()
(7, 3) (11, 3)
(7, 3) (11, 3)

Ranking Criteria Validation¶

This library provides two metrics of evaluation to evaluate the goodness of the ranking criteria the conformance and coherence

  • Conformance measures the adherence of the top-ranked configurations according to the actual query positioning of thoses configurations. We calculate conformance according to the equation below:
$$A(R^k) = 1 - \sum \limits _{i=0} ^{|Q|} \sum \limits _{j=0} ^{k} \frac {\bar{A}(i,j)}{|Q|*k}$$

Consider $R_{s}$ ranking and the top-3 ranked configurations are ${c_{1},c_{2},c_{3}}$, that overlaps only with the bottom-3 ranked configuration in query $|Q|$. That is, ${c_{4},c_{2},c_{5}}$. For example, $c_{2}$ is in the $59^{th}$ position out of 60 positions.
Thus, $A(R^k) = 1- \frac {1}{(11*3)}$, when $k = 3$ and $|Q| = 11$

  • Coherence is the measure agreement between two ranking sets that uses the same ranking criteria accross different experiments. We used Kendall's Index to calculate coherence, which counts the number of dis(agreements) between two ranking sets
$$K(R_{1}, R_{2}) = \sum \limits _{{i,j} \epsilon P} ^{} \frac {\bar{K}_{i,j}(R_{1}, R_{2})}{|P|}$$

In this experiment, we assume that rank sets are the dataset sizes (i.e. 100M and 250M). Kendall’s distance between two rank sets $R_{1}$ and $R_{2}$, where $|P|$ represents the set of unique pairs of distinct elements in the two sets. For instance, the $K$ index between $R_{1}={c_{1},c_{2},c_{3}}$ and $R_{2}={c_{1},c_{2},c_{4}}$ for 100M and 250M is 0.33, i.e., one disagreement out of three pair comparisons.

PAPyA.Ranker.Conformance¶

class Ranker.Conformance(config_path, log_path, ranking_sets, conformance_set, k, h)¶

Parameters:
  config_path : str
  Specify the path to your configuration file(s). i.e ./UIModule/settings_watdiv.yaml</i>
 log_path : str
  Specify the path to your log file(s). i.e ./UI Module/log_watdiv</i>
 ranking_sets : str
  Ranking sets of user choice. i.e dataset sizes (100M)</i>
 conformance_set : list
  List of ranking criterions to see the scores of their conformances.
 k : int
  Value of the top k subset of the ranking criteria. i.e. _k_ = 5, takes the top 5 value of each ranking criteria in the conformance set</i>
 h : int
  Value of the threshold that will be counted for the conformance score. i.e. _h_ = 28, take queries that has a ranking below 28 out of all the configuration</i>

In [45]:
from Ranker import Conformance

conformance_set = ['schemas', 'partition', 'storage', 'paretoQ', 'paretoAgg']
conf = Conformance(config_watdiv, logs_watdiv, '250M', conformance_set, 2, 9)

Ranker.Conformance.run¶

Conformance.run()¶

This method returns a table of conformance scores for each of the ranking criterion that user specify in the conformance_set along with the k and h values for the formula

In [47]:
#conformance scores for all ranking criterions in 250M dataset size
conf.run()
Out[47]:
250M
schemas 0.950
partition 0.675
storage 0.275
paretoQ 0.975
paretoAgg 0.775

Ranker.Conformance.configurationQueryRanks¶

Conformance.configurationQueryRanks(dimension, mode)¶

This method returns a criteria table of a choosen ranking dimension, criteria table is a table that shows rank values of each queries by the top-k configurations

Parameters:
  dimension : str
  user's choosen dimension to view the criteria table

  mode : 0 (default) or 1
  '0' show criteria table by their value
  '1' show criteria table with true and false of h value (true, if h value is higher than h)

In [46]:
conf.configurationQueryRanks(dimension = 'paretoAgg', mode = 1)
Out[46]:
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
pt.horizontal.parquet False False False False False False False False False False False False False False False False False False False False
st.subject.orc True True False False True False True False True True True True True False False False False False False False

PAPyA.Ranker.Coherence¶

class Ranker.Coherence(config_path, log_path, conformance_set)¶

Parameters:
  config_path : str
  Specify the path to your configuration file(s). i.e ./UIModule/settings_watdiv.yaml</i>
 log_path : str
  Specify the path to your log file(s). i.e ./UI Module/log_watdiv</i>
 conformance_set : list
  List of ranking criterions to see their kendall index's scores between ranking sets (i.e. dataset sizes).

In [48]:
from Ranker import Coherence

coherence_set = ['schemas', 'partition', 'storage', 'paretoQ', 'paretoAgg']
coh = Coherence(config_watdiv, logs_watdiv,coherence_set)

Ranker.Coherence.run¶

Coherence.run(rankset1, rankset2)¶

This method returns a table of coherence scores for each of the ranking criterion that user specify in the conformance_set by calculating the number of (dis)agreements between 2 ranking sets

Parameters:
  rankset1 : str
  a string of the _first_ rankset that user wants to compare

  rankset2 : str
  a string of the _second_ rankset that user wants to compare

In [49]:
#example of coherence scores for all ranking criterions by comparing 100M ranking set with 250M ranking set
coh.run('100M', '500M')
Out[49]:
Kendall's Index
schemas 0.333333
partition 0.483660
storage 0.614379
paretoQ 0.637363
paretoAgg 0.533333

Ranker.Coherence.heatMap¶

Coherence.heatMap(rankset1, rankset2, dimension)¶

This method returns a heat map table that shows the coherence between two particular ranking sets that user can choose, the heat map will be sorted by the best performing configurations of the first ranking set

Parameters:
  rankset1 : str
  a string of the _first_ rankset that user wants to compare

  rankset2 : str
  a string of the _second_ rankset that user wants to compare
  dimension : str
  user's choosen dimension to view the heat map

In [50]:
coh.heatMap('100M', "250M", dimension='paretoQ')

Ranker.Coherence.heatMapSubtract¶

Coherence.heatMap(*args, dimension)¶

This method shows the coherence differences between two particular ranking sets that user chooses, the first ranking sets will be the pivot point (between 100M-250M, and 100M-500M shown in the example). The heat map will be sorted by the best performing configurations of the first ranking set

Parameters:
  *args : str
  takes an arbitrary number of ranking sets, the first ranking set will be the pivot point for all the other ranking sets

  dimension : str
  user's choosen dimension to view the heat map

In [51]:
coh.heatMapSubtract('100M', '250M', '500M', dimension='paretoAgg')
In [ ]: